85 research outputs found
Featurebased method for document alignment in comparable news corpora
In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes 4.1 % and 8 % to performance improvement over Pearson’s correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3 % on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
This study investigates machine translation between related languages i.e.,
languages within the same family that share linguistic characteristics such as
word order and lexical similarity. Machine translation through few-shot
prompting leverages a small set of translation pair examples to generate
translations for test sentences. This procedure requires the model to learn how
to generate translations while simultaneously ensuring that token ordering is
maintained to produce a fluent and accurate translation. We propose that for
related languages, the task of machine translation can be simplified by
leveraging the monotonic alignment characteristic of such languages. We
introduce DecoMT, a novel approach of few-shot prompting that decomposes the
translation process into a sequence of word chunk translations. Through
automatic and human evaluation conducted on multiple related language pairs
across various language families, we demonstrate that our proposed approach of
decomposed prompting surpasses multiple established few-shot baseline
approaches. For example, DecoMT outperforms the strong few-shot prompting BLOOM
model with an average improvement of 8 chrF++ scores across the examined
languages.Comment: EMNLP 2023 (Main, Long paper
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
We present SeaEval, a benchmark for multilingual foundation models. In
addition to characterizing how these models understand and reason with natural
language, we also investigate how well they comprehend cultural practices,
nuances, and values. Alongside standard accuracy metrics, we investigate the
brittleness of foundation models in the dimensions of semantics and
multilinguality. Our analyses span both open-sourced and closed models, leading
to empirical results across classic NLP tasks, reasoning, and cultural
comprehension. Key findings indicate (1) Most models exhibit varied behavior
when given paraphrased instructions. (2) Many models still suffer from exposure
bias (e.g., positional bias, majority label bias). (3) For questions rooted in
factual, scientific, and commonsense knowledge, consistent responses are
expected across multilingual queries that are semantically equivalent. Yet,
most models surprisingly demonstrate inconsistent performance on these queries.
(4) Multilingually-trained models have not attained "balanced multilingual"
capabilities. Our endeavors underscore the need for more generalizable semantic
representations and enhanced multilingual contextualization. SeaEval can serve
as a launchpad for more thorough investigations and evaluations for
multilingual and multicultural scenarios.Comment: 15 pages, 7 figure
Measurement of the inclusive and dijet cross-sections of b-jets in pp collisions at sqrt(s) = 7 TeV with the ATLAS detector
The inclusive and dijet production cross-sections have been measured for jets
containing b-hadrons (b-jets) in proton-proton collisions at a centre-of-mass
energy of sqrt(s) = 7 TeV, using the ATLAS detector at the LHC. The
measurements use data corresponding to an integrated luminosity of 34 pb^-1.
The b-jets are identified using either a lifetime-based method, where secondary
decay vertices of b-hadrons in jets are reconstructed using information from
the tracking detectors, or a muon-based method where the presence of a muon is
used to identify semileptonic decays of b-hadrons inside jets. The inclusive
b-jet cross-section is measured as a function of transverse momentum in the
range 20 < pT < 400 GeV and rapidity in the range |y| < 2.1. The bbbar-dijet
cross-section is measured as a function of the dijet invariant mass in the
range 110 < m_jj < 760 GeV, the azimuthal angle difference between the two jets
and the angular variable chi in two dijet mass regions. The results are
compared with next-to-leading-order QCD predictions. Good agreement is observed
between the measured cross-sections and the predictions obtained using POWHEG +
Pythia. MC@NLO + Herwig shows good agreement with the measured bbbar-dijet
cross-section. However, it does not reproduce the measured inclusive
cross-section well, particularly for central b-jets with large transverse
momenta.Comment: 10 pages plus author list (21 pages total), 8 figures, 1 table, final
version published in European Physical Journal
Systematic Review of Potential Health Risks Posed by Pharmaceutical, Occupational and Consumer Exposures to Metallic and Nanoscale Aluminum, Aluminum Oxides, Aluminum Hydroxide and Its Soluble Salts
Aluminum (Al) is a ubiquitous substance encountered both naturally (as the third most abundant element) and intentionally (used in water, foods, pharmaceuticals, and vaccines); it is also present in ambient and occupational airborne particulates. Existing data underscore the importance of Al physical and chemical forms in relation to its uptake, accumulation, and systemic bioavailability. The present review represents a systematic examination of the peer-reviewed literature on the adverse health effects of Al materials published since a previous critical evaluation compiled by Krewski et al. (2007).
Challenges encountered in carrying out the present review reflected the experimental use of different physical and chemical Al forms, different routes of administration, and different target organs in relation to the magnitude, frequency, and duration of exposure. Wide variations in diet can result in Al intakes that are often higher than the World Health Organization provisional tolerable weekly intake (PTWI), which is based on studies with Al citrate. Comparing daily dietary Al exposures on the basis of “total Al”assumes that gastrointestinal bioavailability for all dietary Al forms is equivalent to that for Al citrate, an approach that requires validation. Current occupational exposure limits (OELs) for identical Al substances vary as much as 15-fold.
The toxicity of different Al forms depends in large measure on their physical behavior and relative solubility in water. The toxicity of soluble Al forms depends upon the delivered dose of Al+ 3 to target tissues. Trivalent Al reacts with water to produce bidentate superoxide coordination spheres [Al(O2)(H2O4)+ 2 and Al(H2O)6 + 3] that after complexation with O2•−, generate Al superoxides [Al(O2•)](H2O5)]+ 2. Semireduced AlO2• radicals deplete mitochondrial Fe and promote generation of H2O2, O2 • − and OH•. Thus, it is the Al+ 3-induced formation of oxygen radicals that accounts for the oxidative damage that leads to intrinsic apoptosis. In contrast, the toxicity of the insoluble Al oxides depends primarily on their behavior as particulates.
Aluminum has been held responsible for human morbidity and mortality, but there is no consistent and convincing evidence to associate the Al found in food and drinking water at the doses and chemical forms presently consumed by people living in North America and Western Europe with increased risk for Alzheimer\u27s disease (AD). Neither is there clear evidence to show use of Al-containing underarm antiperspirants or cosmetics increases the risk of AD or breast cancer. Metallic Al, its oxides, and common Al salts have not been shown to be either genotoxic or carcinogenic. Aluminum exposures during neonatal and pediatric parenteral nutrition (PN) can impair bone mineralization and delay neurological development. Adverse effects to vaccines with Al adjuvants have occurred; however, recent controlled trials found that the immunologic response to certain vaccines with Al adjuvants was no greater, and in some cases less than, that after identical vaccination without Al adjuvants.
The scientific literature on the adverse health effects of Al is extensive. Health risk assessments for Al must take into account individual co-factors (e.g., age, renal function, diet, gastric pH). Conclusions from the current review point to the need for refinement of the PTWI, reduction of Al contamination in PN solutions, justification for routine addition of Al to vaccines, and harmonization of OELs for Al substances
Measurement of charged-particle event shape variables in inclusive root(s)=7 TeV proton-proton interactions with the ATLAS detector
The measurement of charged-particle event shape variables is presented in inclusive inelastic pp collisions at a center-of-mass energy of 7 TeV using the ATLAS detector at the LHC. The observables studied are the transverse thrust, thrust minor, and transverse sphericity, each defined using the final-state charged particles' momentum components perpendicular to the beam direction. Events with at least six charged particles are selected by a minimum-bias trigger. In addition to the differential distributions, the evolution of each event shape variable as a function of the leading charged-particle transverse momentum, charged-particle multiplicity, and summed transverse momentum is presented. Predictions from several Monte Carlo models show significant deviations from data
Pan-cancer analysis of whole genomes
Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale(1-3). Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4-5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter(4); identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation(5,6); analyses timings and patterns of tumour evolution(7); describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity(8,9); and evaluates a range of more-specialized features of cancer genomes(8,10-18).Peer reviewe
- …